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Received Jan 23, 2022 ratory failure. The potential relationship between smoking and COVID-19 has 


been recently investigated. In this paper, we study and investigate the role of 
the decision support system to predict the ratio of respiratory failure in smok- 
ers versus non-smokers among COVID-19 patients. We employed a classifier 
that predicts the ratio of respiratory failure as well as the ratio of the death toll 
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ployed model demonstrate a prediction accuracy of 77% when applied on a sam- 
COVID-19 ple from 23 countries that confirmed the highest number of COVID-19 patients. 
Decision support system This was obtained from The World Bank Data-Health Nutrition and Population 
Machine learning Statistics. As a result, a strong (significant) relationship between smoking to- 
Respiratory bacco and COVID-19 was illustrated by the employed model. Our approach 
Smokers achieves a good recall (78%). Thus, smokers are more susceptible to respiratory 
failure than non-smokers, as COVID-19 complications. 
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1. INTRODUCTION 

Machine learning (ML) is defined as a branch of artificial intelligence (AI) that is interested in “teach- 
ing” computer how to process without the need to explicitly implement every possible scenario [1]. The main 
idea, in short, is to develop algorithms that are able to learn, by training, on a very large number of inputs, 
possibly with known results [2], [3]. Supervised learning is one of the major types of ML. This type fundamen- 
tally relies on estimating future instances based on known (current) instances. The major goal of supervised 
learning is to discover patterns of class labels that rely on predictor features. These patterns are utilized for the 
selection of class labels for testing instances. The selection process is based on the predictor features that are 
known [4], [5]. Feature selection (FS), also known as attribute selection, is an essential phase of building any 
predictive model [6], [7]. This is essential since the number of features could be large and others less informa- 
tive. Without a doubt, the COVID-19 pandemic has changed the world’s view on life. This requires the world 
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to make great efforts, go hand in hand, and use the best available technologies to help predict infection and 
control the spread of this real threat [8], [9]. Coronavirus (SARS-CoV-2) still a widespread health contingency 
in the world [10]. Even though the evidence surrounding pharmaceutical therapy has been developed quickly, 
however the lack of well established preventive methods has made the efficient triage of COVID-19 patients a 
difficult process. While the forecast scores, such as the modified early warning score (MEWS) [11], are very 
helpful in determining the severity of illness of a COVID-19 patient, there is a lack of research studies that ex- 
amine the capability of these scoring systems when predicting COVID-19 infected patients and mortality rates 
[2]. Also, the discriminatory capability of such rules-based results has been proven recently to lack quality 
[10]. 

From the epidemiologic predictions perspective, the expectation is that the hospitals will face an in- 
creased number of admissions of COVID-19 patients. And the expectation is that the patient triage will stay 
sustainable to smooth the efficient distribution of minimal resources|13]. Due to the early similarities of symp- 
toms between patients who are in danger of decompensation and ones in need of mechanical ventilation, physi- 
cians become more aggressive in monitoring patients which consequently narrows down the procedures of 
more controlled climate for intubation. The longer the waiting time to decide about itubation puts patients at 
risk of danger complications, including, hypotension, peri-intubation hypoxia, cardiac arrest and arrhythmia 
[14]. 

The COVID-19 pandemic sets most of the health care systems in an inadequate situation and demands 
for modern methods to tackle this unmatched public health and clinical contingency. The clinical complications 
of COVID-19 range from asymptomatic issues to acute pneumonia, in which progression to respiratory failure 
is diffecult to predict. Pneumonia, in any event, occures in the second or third week of symptomatic infection, 
and it is described to have a death rate of 3% to 10%. Complications like pneumonia increase the need for 
mechanical intubation and run the risk of multi organ failure. Generally, patients should report the abrupt 
outset of dyspnoea during activity or rest [15], [16]. A respiratory rate greater than 30 breaths per minute, blood 
oxygen saturation less or equal to 93% and a partial pressure of areterial oxygen to fraction inspired oxygen 
ratio (PaO2/FiO2) is less than 300 mmHg are significant clinical sings of sharp respiratory trail syndrome acute 
respiratory distress syndrome (ARDS) leading to mild up to severe respiratory failure. Overall, there is a rising 
stage of doubt jointly in the progression of the patient’s health care and in the speed at that patients improve 
respiratory fail demand mechanical ventilation. ML models, like those utilized to make the model, have shown 
potential to make predictive models that can be adopted to help and develop clinical ruling for a wide variety 
of results and have lately been used in echo to the COVID-19 emergency [17]. 

While current data science and machine learning technologies have proven to be very useful in diag- 
nosing patients, tracking the spread of the virus and speeding up the process of finding an effective vaccine, 
health organizations and governments are still struggling to contain the spread of COVID-19 virus. So far, 
the two most effective techniques in combating the dissemination of COVID-19 are data science and machine 
learning. And those techniques are the ones that have aided China curb the spread of the virus in a short time 
(18}, [19]. The use of ML to better understand risk factors in large and mixed groups of patients with Corona 
so that the use of algorithms in the objective evaluation of these factors can help determine the percentage of 
respiratory failure in smokers and non-smokers among patients with COVID-19. 

Unfortunately, the pandemic in progression, there is limited research regarding the health status of the 
patients as well as their risk factors, such as smoking. Due to its detrimental impact on societies, smoking has 
been a significant concern for many generations. A study of the role of decision support system for COVID-19 
patients to develop a prediction model to obtain a ratio of respiratory failure in smokers to non-smokers, who 
are suffering from COVID-19. In addition to predicting the death toll in those patients. As a result, to highlight 
the negative impact of tobacco as extensive evidence of a plethora of respiratory diseases. In conclusion, there 
are no studies in the literature that show that smoking increases the death tolls among COVID-19 patients or 
the severity of disease in those smokers. 

The current study provides a new machine learning model that aims to predict the ratio of respiratory 
failure in smokers to non-smokers among patients with COVID-19, and the ratio of the death toll in smokers to 
non-smokers. This would help provide care to patients in a system where resources are limited by enabling risk 
recording, based on data from many health care delivery centers, including demographics, laboratory findings, 
and existing diseases. This study tries to address two major research questions: 

— RQI1: Do machine learning approaches support predicting the ratio of respiratory failure between COVID-19 
smokers patients and their opposites? 
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— RQ2: Could we predict a ratio of the death toll in smokers to non-smokers between COVID-19 patients 
through machine learning methods? 


To answer these research questions, the study used supervised classifiers in which the introductory 
body is divided into two groups: i) a training group and ii) a testing group. The first group is the group that is 
used to train the advanced machine learner. On the other hand, learner performance is calculated by the second 
group. A 10-fold cross-validation technique was used to obtain both training and test sets. Also, WEKA 
Toolkit [2], [21], was used to perform supervised classification in our work. Finally, a set of different 
classification algorithms were used for the decision support systems provided to the healthcare industry that 
have been used to develop the employed models [23]-[26]. 

This paper is organized as follows. Section 2 reviews related works. Section 3 discusses the method- 
ology that follows in this study. The study results are presented in section 4. Section 5 introduces the main 
threats to validity, followed by conclusions and some future research directions in section 6. 


2. RELATED WORK 

Burdick et al. study aimed to improve machine learning based models for risk prediction critical 
illness outgrowth in COVID-19 patients. To evaluate how ML risk prediction models may help look after 
COVID-19 patients in a clinical setting. 197 patients were registered in the Respiratory decompensation and 
pattern for the triage of COVID-19 patients: a prospective study (READY) clinical trial. The study result 
showed that the algorithm had a higher diagnostic odds rate (DOR, 12.58) for predicting ventilation than a 
comparator before the usual time caution order, the modified early warning score (MEWS) [27]. The algorithm 
also carried out significantly higher sensitivity (0.90) than MEWS, which finished an allergy of 0.78, while 
preserving a higher specificity (p < 0.05). 

Explaining research chronological, including research design, research procedure (in the form of al- 
gorithms, Pseudocode or other), how to test and data acquisition [7]-{17]. The description of the course of 
research should be supported references, so the explanation can be accepted scientifically [2], [6]. Figures 
1-2 and Table 1 are presented center, as shown below and cited in the manuscript [7], [20}-[30]. The effects 
of electrical discharges to acidity of HVNE and NELV has been illustrated in Figure 2(a) and the effects of 
breakdown voltage of NE and NELV has beem illustrated in Figure 2(b). Patanavanich and Glantz study 
aimed to give out a meta-analysis of the association among smoking and the progression of the intended ill- 
ness COVID-19. The results conclude that smoking is a danger factor for the progression of COVID-19, with 
smokers own higher odds of COVID-19 progress than those people who not smokers. 

Ferrari et al. study aimed to estimate a 48-hour prognosis of mild to acute respiratory failure, 
requiring mechanical ventilation, in hospitalized patients. The study represents a total of 198 patients giving a 
share in giving a rise to 1068 serviceable observations which let us build 3 predictive models founded respec- 
tively on 31-variables mark and symptoms, 39-variables laboratory biomarkers, and 91-variables as a structure 
of the two. The last model, the “boosted mixed model”, contains 20 variables chosen from model 3, carried 
out the best predictive execution (AUC=0.84) without doing to pot the FN rate. Its clinical performance was 
adopted in a narrative case report as an example. The study improved a machine model with 84% prognosis 
accuracy that is fit to help clinicians in the decision-making process and share to develop new analytics to 
develop care at high technology readiness levels. 

The study that conducted by Lyu et al. used qualitative and quantitative CT indicators of the chest 
to assess the clinical severity of COVID-19 pneumonia and characterize the topography of critical cases. 51 
patients with COVID-19 pneumonia were registered and they were divided into three groups, one for normal 
cases (group A, n=12), severe cases (group B, n=15) and critical cases (group C, n=24), retroactively. Qual- 
itative and quantitative indicators of chest CT were recorded and compared using fisher’s exact test, one-way 
ANOVA test, Kruskal-Wallis H test, and receiver operating characteristics analysis. The results showed that 
and depending on the severity of the disease, the number of affected lung segments and lobes, the frequency 
of consolidation, the insane paving pattern and the bronchopulmonary gram increase in more severe cases. 
Qualitative indicators, including total lung severity score and the overall result of mad paving and uniformity, 
could distinguish groups B and C of A (69% sensitivity, 83% specificity, and 73% accuracy) but were similar 
between group B and group C. The quantitative and qualitative indicators pooled among these three groups 
were of high sensitivity (B+C versus A, 90%; C versus B, 92%), qualitative (100%, 87%) and accuracy (92%, 
90%). Critical cases had a higher overall severity score (> 10) and a higher overall score for insane paving and 
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consolidation (> 4) than normal cases. 

Alshirah and Al-Fawa’reh study aimed to become aware of phishing URLs using machine learn- 
ing lexical feature-based analysis during adopting the method detects phishing URLs across analyzing URLs 
to take out lexical characteristics features. Subsequently, apply a machine learning method based on the take 
out features. The dataset was gathered from different sources, and it contains four dissimilar attack scenarios: 
(Defacement, spam, phishing, and malware). In spite of this, in this research, the emphasis was on Phishing 
URLs. The dataset was operated as input for numerous machine learning and statistical uncovering models 
“(Random forest (RF), decision tree classifier (DT), gaussian naive bayes (GNB), k-nearest neighbor (KNN), 
logistic regression, support vector classifier (SVC), quadratic discriminant analysis (QDA), perceptron, syn- 
thetic minority oversampling technique (SMOTE))”. These models were employed to predict Phishing URLs 
based on lexical characteristics features. The outcomes point to a comparatively good accuracy rate. The Ran- 
dom forest model has shaped the best accuracy (98%) likened to the other detection models. In addition, the 
RF takes shaped the best precision and recall (98%), correspondingly. 


3. METHOD 

The methodology that we followed in our study is presented in this section and shown in Figure 
1. Firstly, we discuss the dataset that is used for our evaluation, specifically the collecting and processing 
processes. Then, we introduce the factors that are used in the learning of the classifier. Lastly, we present the 
developed model and the metrics that are used in the evaluation experiments. 


Collected Dataset 


4 


Data Pre-Processing 
4 
Supervised Classification 
4% 4 


Training Testing | 


Figure 1. The proposed methodology 


3.1. Studied dataset 


Our study was conducted among a sample of 23 countries that confirmed the highest number of 
COVID-19 patients during January and February of 2020 from the world bank data: health nutrition and 
population statistics. In this research, we consider the variables classification factors that shows in Table 1. 
In this paper, was pre-processed the dataset using Microsoft Excel. This data was used as an input for various 
prediction models based on statistical model (logistic regression (LR)) and machine learning model (support 
vector machine (SVM), and multi-layer perceptron (MLP)). These models were utilized to predict potential 
patients of COVID-19 based on their signs and symptoms. 


Table 1. Summary of classification factors 
No Classification factors 
Country 
TotalSmokingRate 
MaleSmokingRate 
FemaleSmokingRate 
Pop2020 
COVID19_confirmed 
COVID19_recovered 
COVID19_deaths 
Confirmed to pop 
Recovered to confirmed 
Deaths to confirmed 
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3.2. Creating the corpus 

The key step in performing the classification purpose is generating the corpus that characterizes the 
input of the classifiers, Figure 2, shows the major steps of our prediction model. For this work, the corpus 
includes the extracted values relevant to every classification factor for each instance of our studied dataset. 


Table 2, summarizes the corpus information. We used the level of failure in smokers to non-smokers from A to 
F. 


Table 2. Summary of corpus information 
Number of instances A B C D E F 
250 10 65 119 22 22 12 


3.3. Classification algorithms 

In our experiments, we employed the supervised classifiers in which the included corpus is distributed 
into two groups as a training set and a testing set. The first group is the one that is used to train the developed 
machine learner. On the other hand, the performance of the learner is computed through the second group. We 
used the widely popular 10-fold cross-validation technique to obtain both the training and testing sets to 
get unbiased results, which offered better model performance in my dataset. The WEKA toolkit is employed 
to perform the supervised classification. There are various classification algorithms that widely used of 
decision support systems presented for the healthcare domain and have been used to develop the employed 
models [32], [B3], these algorithms are as follows: 

— SVM: seek to figure out a decision boundary between classes, expanding the margin of the separating line; 
while one of the drawbacks of this approach is that it can be only applied for binary classification [34]. SVM 
can construct the optimal separating lone, which increase the distance between the contiguous sample data 
[35]. SVM: this algorithm rises the dimensionality of training instances to achieve differentiable points in 
one of the dimensions. This algorithm is very popular since it is efficient in high dimensional spaces and 
thus provides more accurate results [36]. 


— Random tree: random tree is an ensemble training method for classification. This method is a set of separate 
decision trees in which each tree is produced from different samples and subsets of the training data. Ran- 
dom Tree is a supervised learning algorithm that produces many individual learners. It generates a random 
set of data for creating a decision tree. Random trees deal with both classification and regression problems. 
Random tree is a set of tree predictors (forest). The classifier gets the input feature vector, classifies it with 
every tree in the predictors. Random Tree is an active data mining algorithm that is used with large amounts 
of data. The technique employs several classification trees to a data set and next generates the prediction 
from all of the correlated trees [37], [B8]. 


— Decision tree: decision tree in particular J.48 algorithm is commonly used to classify different data sets and 
perform accurate results of the classification. J48 algorithm is one of the best machine learning algorithms to 
investigate the data category continuously. it engages more memory space and reduces the performance and 
accuracy in classified data. This algorithm creates a binary tree for classification problems. The approach 
splits the data into range using the values of attributes for that item that are recognized in the training set 


[39]. 


— Naïve Bayes: Naive Bayes allocates the highly expected class when given characteristics are independent 
of any particular class. Naive Bayes is effective in many fields such as text categorization, and therapeutic 
diagnosis. This method assumes that all classification factors are independent. It shows great performance 
in terms of accuracy when it was applied in medical domain studies [40]. 


— SMO: sequential minimal optimization is used for solving the quadratic equation programming problem 
that occurs throughout the training of support-vector machines. SMO is commonly used for training SVMs 
because of high-speed training. This approach that trains a support vector leaner using polynomial. It 
converts attributes from nominal to binary values [41]. 


— Logistic regression: logistic regression is a predictive analysis which estimates the probability of one de- 
pendent variable based on one or more independent variables. Logistic Regression is a linear model for 
categorization rather than regression. This approach uses regression models for classification tasks that 
models the posterior class probabilities for each of the needed n-classes from the dataset [42]. 


— K-Star: the main idea of the k-star is to take advantage of instance-based classifier and dataset features 


Respiratory failure in COVID-19 patients a comparative study ... (Mohammad Kharabsheh) 


1132 m) ISSN: 2502-4752 


reduction, the model has the ability to recognize features with high detection rate and low false negative. 
Selecting a good quality subset of features demonstrates to be significant in enhancing the performance of 
the system. Features are filtered to generate the most important feature subset before the start of the training 
process. K-Star represents a nearest neighbor method uses the distance calculations from the training set, 
such as the mahalanobis metric, to classify the instances of the testing set [43]. 

— Decision table: decision table is a method for prediction from decision trees and it is an ordered set of 
If-Then rules that have the possibility to be more efficient and therefore more reasonable than the decision 
trees. Selection to explore decision tables because it is an easier, less intensive algorithm than the decision 
tree-based approach. Decision table creates a decision table classifier and estimates feature subsets using 
best-first search and can utilize cross-validation for evaluation. The table for a given dataset is generated 
using grouping-and-counting in order to apply classification over unknown sample [44]. 

— K-NN: k-nearest neighbors’ algorithm is a non-parametric technique used for categorization and regression. 
Nearest neighbor is a commonly used text classifier since of its ease and effectiveness. Its learning phase 
comprises storing all learning examples as classifier; therefore, it can be called as lazy learner because 
it’s suspended the decision on how to generalize the learning data until each new instance is encountered. 
This technique is based on discovering the unidentified instances using the formerly known instances (e.g., 
nearest neighbor) and hence classify other instances using the voting approach [45]. 

— IBk: IBk is nearest-neighbor algorithm that uses the distance metrics created from the training set as closest 
associated vectors that would be used to classify data instances of the testing set [46]. 


Figure 2. The major steps of our prediction model 


4. RESULTS AND DISCUSSION 
To evaluate the effectiveness of our proposed classification model, we choose to use the following 
metrics: 
— Precision: the ratio of retrieved instances that are truly relevant. It is calculated as (P=true positives/(true 
positives+false positives)) [47]. 
— Recall: the ratio of relevant instances that are retrieved by the classifier and hence it is computed as (R=true 
positives/(true positives+false negatives)) [47]. 


— F-Measure: a metric that depends on both recall and precision of a model and thus calculated by a combina- 
tion of these two metrics as ((2*Recall*Precision)/(Recall+Precision)). The value of this metric is between 
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0 and 1 [47]. 


— Accuracy: A metric that represents the number of predictions that are incorrectly categorized positive and 
incorrectly categorized negative and is calculated as (R=(true positives+true negative)/(true positives+true 
negative+false negatives+false positive)) [47]. 


We present our obtained results from the undertaken classification experiments and hence we answer the 
research questions mentioned early. The evaluation results show that our approach achieves a recall of 78% as 
we shown in Figure 3. 


Supervised Classifier 


# RandomTree NJ48 :: Nave 


Figure 3. Evaluation results of our approach (recall) 


RQ1: Do machine learning approaches support predicting the ratio of respiratory failure between 
COVID-19 smokers patients and their opposites? To answer the above question, the recall, precision, and F- 
measure metrics are used to evaluate the effectiveness of employed models. Using the factors that are given 
in Table 3, our proposed classifiers are trained using a combination of all these factors. A comparison be- 
tween several classification techniques has been performed. A comparison with a baseline approach was also 
conducted. The performance results of the employed models (classifiers) are shown in Table 3. Our results 
show that there is an improvement in the prediction process in terms of all evaluation measures. For example, 
a comparison between our Naïve Bayes classifier and the baseline model shows a 0.78 in terms of recall and 
0.77 in terms of precision improvement ratio. Which means it is possible to build machine learning models that 
have a highly accurate prediction capability of the ratio of respiratory failure in smokers versus non-smokers. 

The second observation is that Naïve Bayes and SMO are more precise than the rest of the machine 
learning classifiers in terms of accuracy and F-measure. For example, Naïve Bayes calculate a probability for 
each class based on the probability distribution in the training dataset. As a result, the probability and prior 
are able to be updated dynamically to achieve flexibility and robustness to classification errorsm with each 
training example. On the other hand, the SMO learner achieves better F-measure because of increasing the 
dimensionality of data until the data points are differentiable in some dimension. Additionally, the space usage 
needed for SMO is linear in the size of the training set; therefore, it allows SMO to handle very large training 
sets with higher accuracy. 


Table 3. Obtained classification results of the ratio of respiratory failure in smokers to non-smokers. 


Learner Accuracy Recall Precision F-measure 
RandomTree 0.56 0.65 0.53 0.58 
J48 0.59 0.65 0.61 0.63 
NaiveBayes 0.77 0.78 0.77 0.79 
SMO 0.63 0.59 0.59 0.59 
Logistic 0.48 0.61 0.47 0.53 
IBK 0.49 0.47 0.43 0.45 
KStar 0.52 0.47 0.71 0.57 
DecisionTable 0.61 0.6 0.64 0.62 
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RQ2: Could we predict a ratio of the death toll in smokers to non-smokers through machine learn- 
ing methods? To answer the second research question, we need to evaluate the usefulness of each feature 
separately as a predictor of mortality among smokers and non-smokers patients who infected with COVID- 
19 through machine learning methods. To do this, we developed the employed models (classifiers) using 
decision trees that were trained using all of the classification factors previously discussed. Using decision 
trees, we can classify traits based on their utility in our prediction experiments by performing a Top Node 
analysis linked to the decision tree approach. This node analysis approach calculates the presence of each 
factor under consideration by examining the structure and levels of the developed decision tree. Then, the 
tree level where the attribute occurs and the counted number of the attribute are used to determine the util- 
ity rank of that attribute, the most influencing factor will be the generated decision tree root node, while 
the factor’s effectiveness decreases as we move toward the tree’s leaves. Thus, in our study, we devel- 
oped a decision tree using the C4.5 algorithms, which was trained using all the factors researched in this 
work. C4.5 is a greedy technique that adds decision nodes at each level of the generated tree by follow- 
ing a training set team-and-con algorithm. At each stage of the running algorithm, the information ob- 
served from each attribute is computed, and next to the attribute with the highest ranking, the steps of run- 
ning the greedy algorithm are set to a certain threshold value, which is used to determine the number of 
records in the terminal nodes while building the tree. The performance results obtained from our decision tree 
classifier are We present our obtained results from the undertaken classification experiments and hence we an- 
swer the research questions mentioned early. The evaluation results show that our approach achieves a recall 
of 78% as we shown in Figure: i) recall: 0.52, ii) precision: 0.49 and iii) f-measure: 0.50. 


In addition, our analysis results are mentioned in Table 4. Specifically, for each influential factor, 
the table provides the level at which it appears in the created tree (e.g., The first column) and the occurrence 
frequency associated with the factor (such as the second column). As we can see, the percentage of smokers 
represents the root node of our resulting tree, and is therefore the most influential factor in our experiments. 
That is, the death rate among smokers infected with COVID-19 will be the most expected rate. As for the 
gender of the smoker, the percentage of females represents the highest rate of confirmation of infection with 
COVID-19, and the frequency associated with factor (1) in the first column, and finally, the proportion of 
those recovering from those infected with COVID-19 is considered the most influential factor in the incidence 
associated with factor (2) for the first column. 


Table 4. Outcomes of top node analysis 


Level Occurrence Attribute 
Count 
0 6 TotalSmokingRate 
2 MaleSmokingRate 
1 7 FemaleSmokingRate 
11 Covid19_Confirmed 
2 13 Covid19_Recovered 
2 Covid19_Deaths 
3 Confirmed To Pop 


5. THREATS TO VALIDITY 


As with any case study that based on a sample of smokers and non-smokers, we have some potential 
threats that prevent us from generalizing our findings to different data sets in various settings. The data set 
could not be illustrative for all the samples, so we could not generalize our results to a variety of data sets. 
Moreover, there may be other features that were not present that were used in this study (for example, smoking 
sessions, the psychological state of the patient with COVID-19, and the age of the person). These factors 
may positively influence the results we obtained. Our developed classifiers are based on successful machine 
learning techniques that are widely used in the literature. However, there is a classification of some flaws in 
each method that may negatively affect the validity of our experiments. Therefore, developing classifiers using 
other machine learners will be our future consideration. 
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6. CONCLUSION AND FUTURE WORK 


In this study, we’ve investigated whether machine learning approaches can help predict the ratio of 
respiratory failure in smokers to non-smokers among COVID-19 patients and the ratio of the death toll in 
smokers to non-smokers. Employed the supervised classifiers in which the inputted corpus is distributed into 
two groups: a training set and a test set. The first group is the one that is used to train the developed machine 
learner. On the other hand, the performance of the learner is computed through the second group. Here, 
we used the widely popular 10-fold cross-validation technique to obtain both the training and test sets. We 
employed the WEKA toolkit to perform the supervised classification in our work. We also discussed the various 
classification algorithms that were widely used in the literature of decision support systems presented for the 
healthcare domain and have been used to develop the employed models (classifiers). Our results show that 
the best equitable recall of 65% and the worst recall value is 47%. Add to that, the employed model achieved 
the best precision value of 71% and the worst value of 43%. By performing a top node analysis, we found 
that the attribute smokers are the most influential attribute in predicting the death rate among smokers infected 
with COVID-19 will be the most expected rate. As for the gender of the smoker, the percentage of females 
represents the highest rate of confirmation of infection with COVID-19 and the frequency associated with the 
factor (1). Finally, the proportion of those recovering from those infected with COVID-19 is considered the 
most influential factor in the incidence associated with the factor (2). We aim to explore more classification 
factors and study predicting the ratio of respiratory failure in smokers to non-smokers between COVID-19 
patients using the machine learning approach and predicting nicotine dependencies in future studies in order to 
achieve better prediction performance. We plan to enrich our study by investigating more varies datasets from 
different countries and environments. 
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